The following project is a replication attempt of the machine learning model described in Huynh et al. [1] predicting lead exposure from drinking water within the city of Chicago. Lead-contaminated drinking water at the block level was defined as a binary variable indicating whether the majority of tests within a block have at least 1 ppb lead concentration. According to their model, approximately 75% of blocks are estimated to have lead-contaminated drinking water. The greatest predictors of lead-contaminated drinking water were geographic areas, population at the block, and number of buildings.
Huynh et al. (and by extension, I) used the following data sources:
Data Sources
Source
Measure
Extent
City of Chicago Department of Water Management Lead Test Data
Consecutive lead tests (ppb)
Anonymized to the block
Census
Block FIPS
Block
Population (#)
Block
Race/ethnicity (#)
AIAN
Asian
Black
Hispanic
White
Block
American Community Survey
Block FIPS
Block
Block group FIPS
Block group
Population (#)
Block group, tract
Race/ethnicity (#)
AIAN
Asian
Black
Hispanic
White
Block group, tract
Housing units (#)
Block group
Median house value ($)
Block group
Upper house value ($)
Block group
Lower house value ($)
Block group
Median homeowner costs ($)
Block group
Education (#)
High school
GED
<1 year of college
> 1 year of college
Associate’s degree
Bachelor’s degree
Master’s degree
Professional School
Doctoral Degree
Block group
Poverty (#)
Block group
English-only speakers (#)
Block group
Computer access (#)
Block group
Internet access (#)
Block group
Complete plumbing facilities (#)
Block group
Vacant Housing (#)
Block group
Owner-occupied (#)
Block group
Children under 5 (#)
Block group
Children under 10 (#)
Block group
Children under 18 (#)
Block group
Chicago Building Footprints
Median building age (years)
Block
Building (#)
Block
Max age (years)
Block
Mean age (years)
Block
Built after 1986 (#)
Block
Chicago Health Atlas
Community area
Community area
Lead poisoning rate (%)
Community area
Historical lead poisoning rate (%)
Community area
Economic diversity index
Tract
Hardship index
Tract
Social vulnerability index
Tract
Major crime (#)
Tract
Eviction rate (%)
Tract
Fine particulate matter concentration (\(\mu\text{g/m}^3\))
Download boundaries at various levels: the city, community areas, and census blocks.
Code
chicagoBoundaries <-st_read("https://data.cityofchicago.org/api/geospatial/qqq8-j68g?fourfour=qqq8-j68g&cacheBust=1712775952&date=20240411&accessType=DOWNLOAD&method=export&format=GeoJSON") %>%st_transform("EPSG:4269")chicagoCommunityAreas <-st_read("https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON") %>%st_transform("EPSG:4269")cookBlocks <-blocks(state ="IL", county ="Cook")
Process
Perform an intersection on census blocks to subset to those blocks that are within Chicago Boundaries. Mutate GEOIDs to create complete block, block group, and tract GEOIDS.
Remove buildings with empty geometries or empty street names. Spatial join buildings to the block. Summarize building characteristics at the block level.
data %>%group_by(GEOID_tract) %>%summarize(n =sum(PB_nTests, na.rm =TRUE),gt1pct =sum(PB_nTests * PB_gt1Pct, na.rm =TRUE)/ n ) %>%group_by(n, gt1pct) %>%summarize(nTract =n()) %>%ggplot(aes(x = n, y = gt1pct, color = nTract, size = nTract)) +geom_point() +labs(title ="Number of Tests vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",x ="Tests (#)",y ="Tests with Greater than 1 PPB Lead Detected (%)") +scale_color_gradient(low ="steelblue1", high ="black") +scale_y_continuous(labels = scales::percent) +theme(plot.title =element_text(hjust =0.5),legend.position="none" )
Code
data %>%group_by(GEOID_tract) %>%summarize(nPer100People =100*sum(PB_nTests, na.rm =TRUE) / population_tractE[1],gt1pct =sum(PB_nTests * PB_gt1Pct, na.rm =TRUE)/sum(PB_nTests, na.rm =TRUE) ) %>%ggplot(aes(x = nPer100People, y = gt1pct)) +geom_point(color ="steelblue1") +labs(title ="Number of Tests per 100 Residents vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",x ="Tests Per 100 Residents (#)",y ="Tests with Greater than 1 PPB Lead Detected (%)") +scale_y_continuous(labels = scales::percent) +theme(plot.title =element_text(hjust =0.5),legend.position="none" )
We can see that residents on the north side have disproportionately low positive lead rates when compared to that of the rest of Chicago
data %>%group_by(GEOID_tract) %>%summarize(bldAge_mean =sum(bldAge_mean * bld_number, na.rm =TRUE)/sum(bld_number, na.rm =TRUE),PB_gt1Pct =sum(PB_gt1Pct * PB_nTests, na.rm =TRUE)/sum(PB_nTests, na.rm =TRUE) ) %>%ggplot(aes(x = bldAge_mean, y = PB_gt1Pct)) +geom_point(color ="steelblue1") +labs(title ="Mean Building Age vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",x ="Mean Building Age (years)",y ="Tests with Greater than 1 PPB Lead Detected (%)") +scale_y_continuous(labels = scales::percent) +theme(plot.title =element_text(hjust =0.5),legend.position="none" )
Code
data %>%group_by(GEOID_tract) %>%summarize(bldAge_nBfr86 =1-((sum(bldAge_nAft86, na.rm =TRUE))/sum(bld_number, na.rm =TRUE)),PB_gt1Pct =sum(PB_gt1Pct * PB_nTests, na.rm =TRUE)/sum(PB_nTests, na.rm =TRUE) ) %>%ggplot(aes(x = bldAge_nBfr86, y = PB_gt1Pct)) +geom_point(color ="steelblue1") +labs(title ="Percent of Buildings built after 1986 vs Percent of Tests with \nGreater than 1 PPB Lead Detected by Tract",x ="Buildings built after 1986 (%)",y ="Tests with Greater than 1 PPB Lead Detected (%)") +scale_x_continuous(labels = scales::percent) +scale_y_continuous(labels = scales::percent) +theme(plot.title =element_text(hjust =0.5),legend.position="none" )
B. Q. Huynh, E. T. Chin, and M. V. Kiang, “Estimated Childhood Lead Exposure From Drinking Water in Chicago,”JAMA pediatrics, p. e240133, Mar. 2024, doi: 10.1001/jamapediatrics.2024.0133.